Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add label polymorphism #4388

Merged
merged 40 commits into from
Apr 28, 2022
Merged

feat: Add label polymorphism #4388

merged 40 commits into from
Apr 28, 2022

Conversation

Marwes
Copy link
Contributor

@Marwes Marwes commented Jan 5, 2022

POC to introduce "label polymorphism" which will allow type signatures to more precisely specify how they transform records, thereby being able to statically determine more errors. This works by letting string literals get the new label("literal") type instead of just string which in conjunction with record fields being capable of being type variables lets functions accept a "label" as a parameter which determines which fields that it runs on.

  • Allowing multiple labels to be passed in an array? (columns: [L]) Probably needs to treat arrays as heterogeneous lists of some sort so they are typed like ["a", "b"] instead of [string]. Should be a separate issue, implemented later if desired.
  • On the other hand functions still want to accept a plain dynamic string (as they do today). So unification needs to cope with that still. We may need "dynamic" labels for this { A with 'dynamic: int, 'dynamic: B } where a record with one (or more!) dynamic labels would accept any field being accessed. We are moving ahead with just statically defined labels, should be rare enough with dynamic ones that we can accept any breakages (though we do need to verify that).
  • Some better syntax for specifying/printing labels so it is clear when they differ. Currently the same single, uppercase leter syntax for variables are use like { A: int }, but that makes it impossible to define a record type with the field A. Add a "A" syntax perhaps (same as the record creation/indexing syntax)? See feat: Accept string literals in the fields of a record type #4664
  • Type variables are currently re-numbered during the convert pass so you can get confusing errors where it says that the field C does not exist, except the field were actually specified with B (as seen below). (This happens at the master branch today as well) Should be another issue, may not bother with this
  • The delayed unfication that is currently there is hacky. It is needed to ensure that we see which label B has before we do the record unfication. A better way may be to add a constraint on B like B must be one of the fields in record X which can be solved later.
  • Feature flag so we can release and compile code without labels.
  • Need more tests, or confidence that current tests in labels.rs have adequate coverage

Based on #4664

Copy link
Contributor

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the direction this proposal is going. Have a few questions below.

{ 'column: A } <=> { b: string, a: int }
```

There may be a consistent way to unify these records in the face of type variables, however an easy workaround would be to delay the unification of records with unknown fields until they have been resolved, at which point they can unify normally. If there is a field that is still unknown when type checking is done we can designate that as an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me delaying unification until records with unknown fields have been resolved makes the most sense. Its how I reason about this logic in the first place. Are there any drawbacks to delaying? Multiple passes over the semantic graph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds some complexity, and could possibly give worse error messages (though as long as we take care to store all the context we need for the errors, it shouldn't be an issue).

There may be some performance implications as well as we need to know/decide when to evaluate these delayed constraints. Constraints could (in theory at least) depend on each other in complex ways which we would need to untangle to resolve them all

A bit of a tangent, but Rust's typechecker has an largely equivalent system of "obligations" which it brute forces when solving them, I tried to improve that with rust-lang/rust#69218 which would make them know more precisely when they would be solveable, but the overhead was to much).

func(opt: "a")

// Possible extension where we only allow some specific labels to be passed in
builtin func : (opt: "option1" | "option2") => int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here to make a sort of enum type? For things like the method parameter to the quantile function?

Why would we use labels for this other than its convenient? Meaning I don't see a use case yet of when a label used in the context of a column name would be constrained to a specific set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here to make a sort of enum type? For things like the method parameter to the quantile function?

Yes, at least a limited one. I mostly just extracted it from #4410 where I found some functions which accept only a limited number of strings as an argument (see group and stddev)

// We should (probably) still allow dynamic strings for backwards compatibility sake so the `string` type
// also implements the `Label` Kind
c = "a" + ""
fill(column: c, value: 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how often a truly dynamic column name is used? Maybe we can get some of that from the query archive?

Also what do we consider dynamic? Anything that requires evalauting the semantic graph? What about this case:

c = "foo"
fill(column: c, value: 0)

Is that considered dynamic or static? Seems like it could be considered static. I would also expect this case to be very common as a user might have to use the column name in several different functions and so would like to put it in a variable to make changing it easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could be considered static in this context (probably harder to force it to be "dynamic" than not).

@Marwes
Copy link
Contributor Author

Marwes commented Mar 17, 2022

For just about all functions defined with polymorphic labels in #4410 I can think of a way to redefine them to not require polymorphic labels which may be an argument for not doing this.

builtin columns : (<-tables: [A], ?column: 'label) => [{ 'label: string }] where A: Record
// into, if we want a different name on the output that is a `map(fn: fn(r) => { mycolumn: r.value })`
builtin columns : (<-tables: [A]) => [{ value: string }] where A: Record

builtin count : (<-tables: [A], ?column: string) => [{ _value: int }] where A: Record
// A `map(fn: fn(r) => r.mycolumn)` before `count` would let us pick the column
builtin count : (<-tables: [A]) => [{ _value: int }]

builtin distinct : (<-tables: [{ A with 'column: B }], ?column: 'column) => [{ _value: B }] where A: Record, B: Equatable
// into
builtin distinct : (<-tables: [A]) => [B] where A: Equatable

// Can't write out a type without polymorphic labels as `A` is preserved in the output
// but users can write it manually as `map(fn: fn(r) => { r with as: r.column }` (vs `duplicate(column: "column", as: "as")`
builtin duplicate : (<-tables: [{ A with 'column: B }], column: 'column, as: 'as) => [{ A with 'column: B, 'as: B }]
    where A: Record

// Same as `columns` and `count`. `map` before and after can "rename" the columns to and from the format we want
builtin elapsed : (<-tables: [{ A with 'time: time }], ?unit: duration, ?timeColumn: 'time, ?columnName: 'column) => [{ A with 'time: time, 'column: duration }]
    where
    A: Record

I didn't go through all of the functions thoroughly but other than issues implementing these changed functions it seems pretty plausible to "only" provide functions that operate on streams of plain values instead of records (or records with specific fields and use map to go between).

Lists of polymorphic labels (which is not part of this PR) seem more complicated but at least some operations could be defined.

// map(fn: fn(r) => { column_1: r.column_1, column_2: r.column_2 } // etc 
builtin keep : (<-tables: [{ A with ['column]: C }], ?columns: ['column], ?fn: (column: string) => bool) => [B] where A: Record, B: Record

// Can't be done
builtin drop : (<-tables: [{ A with ['column]: B }], ?fn: (column: string) => bool, ?columns: ['column]) => [{ C with  }]


builtin covariance : (<-tables: [{ A with ['column]: B }], ?pearsonr: bool, ?valueDst: 'dst, columns: ['column]) => [{ A with 'dst: B }]
    where
    A: Record
// Specify two specific fields in the signature and `map` to them
builtin covariance : (<-tables: [{ A with column1: B, column2: B }], ?pearsonr: bool) => [{ A with dst: B }]
    where
    A: Record

// Can't be translated, we rely on the other fields (`A`) to be preserved
builtin sort : (<-tables: [{ A with ['column]: B }], ?columns: ['column], ?desc: bool) => [{ A with ['column]: B }]
    where A: Record,
          B: Comparable

@wolffcm
Copy link

wolffcm commented Mar 17, 2022

@Marwes

In general I like the idea of using smaller composable operations, and especially map since we are making it more performant now. I have some concerns though.

For this example with aggregate functions:

builtin count : (<-tables: [A], ?column: string) => [{ _value: int }] where A: Record
// A `map(fn: fn(r) => r.mycolumn)` before `count` would let us pick the column
builtin count : (<-tables: [A]) => [{ _value: int }]

How does this work for a typical example where there are group key columns? So say something like

import "array"
array.from(rows: [
{host: "A", region: "east", mem: 100},
{host: "B", region: "east", mem: 100},
{host: "C", region: "west", mem: 100},
{host: "D", region: "west", mem: 100},
])
  |> group(columns: ["region"])
  |> sum()

For this example I want to get the sum of memory used by region. How do we preserve the separation between regions if aggregates always produce [{_value: int}]?

A broader question, are you thinking that users will not have to change their scripts in order to get the same behavior as before? Or are you assuming that some changes will be required of users?

@Marwes
Copy link
Contributor Author

Marwes commented Mar 18, 2022

For this example I want to get the sum of memory used by region. How do we preserve the separation between regions if aggregates always produce [{_value: int}]?

True, aggregators need to take groups into account, and the group concept is sort of shared between records and streams. The stream tracks each group, but to check the group key of a record that is accessed through the record. So returning or accepting streams of raw values is probably not that useful.

I suppose there could be some way to try and separate group keys from records, but that likely strays to far towards breaking changes and it may just end replicating what records already do. So I think we need to stick with streams of records for that reason which in turn forces aggregate functions (and other transformations) to accept which column(s) to work with as a parameter.

A broader question, are you thinking that users will not have to change their scripts in order to get the same behavior as before? Or are you assuming that some changes will be required of users?

No, users would be able to keep everything as is, but when working with pivoted data they would need to add map to select the column to work with. However given that aggregators expect groups to be preserved we would need to do something like map(fn: fn(r) => { r with value: r.mycolumn }) to preserve the group key (I think) which doesn't seem great.

@Marwes Marwes force-pushed the labels branch 6 times, most recently from c70fb56 to 848cd8e Compare March 31, 2022 16:41
Marwes pushed a commit that referenced this pull request Apr 11, 2022
For #4388 we need a way to describe a type (label) variable in place of a record field.
As type variables are described as a single, uppercase letter, mimicking that syntax will make it
impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
Marwes pushed a commit that referenced this pull request Apr 11, 2022
For #4388 we need a way to describe a type (label) variable in place of a record field.
As type variables are described as a single, uppercase letter, mimicking that syntax will make it
impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
Marwes pushed a commit that referenced this pull request Apr 12, 2022
For #4388 we need a way to describe a type (label) variable in place of a record field.
As type variables are described as a single, uppercase letter, mimicking that syntax will make it
impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
@Marwes Marwes force-pushed the labels branch 3 times, most recently from cddf3bf to 98a3984 Compare April 14, 2022 12:36
Marwes pushed a commit that referenced this pull request Apr 20, 2022
For #4388 we need a way to describe a type (label) variable in place of a record field.
As type variables are described as a single, uppercase letter, mimicking that syntax will make it
impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
@Marwes Marwes force-pushed the labels branch 3 times, most recently from 78e85ac to 70914d2 Compare April 20, 2022 13:08
Markus Westerlind and others added 17 commits April 28, 2022 13:10
Co-authored-by: Scott Anderson <[email protected]>
<error> isn't the best in this error message but it isn't worse than the previous error
This were causing errors due to accessing the `experimental/universe` imports where accessing, say,
`universe.fill` would instantiate all the functions in the `universe` record but only the labels of the used function would get filled which then triggers the error.

I thought this was necessary to prevent ambiguities from arising when doing unifications
like

```
{ A with B: string } <=> { C with test: string }
{ A with B: string } <=> { D with abc: string }
```

as unifying the labels here (`B <=> "test"/"abc"`) would cause different results depending on
which were done first. However I was able to remove that code so record labels that do not get
concrete will just hang around unused which will at least not cause any problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants